Show the code
import pandas as pd
import numpy as np
import sqlite3
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)Course DS 250
[Tanner Hamblin]
Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.
The following table shows that there are 3 total baseball players records from BYUI ranging from the years 1997 to 2014. the three of them had a combined career length of 30 years and displays the salary for each year the following player played.
# %%
sqlite_file = 'lahmansbaseballdb.sqlite'
con = sqlite3.connect(sqlite_file)
q = '''
SELECT s.salary, s.playerID, c.schoolID, s.yearID, s.teamID
FROM salaries s
LEFT JOIN collegeplaying c
ON s.playerID = c.playerID
WHERE schoolID = 'idbyuid'
ORDER BY salary DESC
'''
' '
results = pd.read_sql_query(q,con)
display(results)| salary | playerID | schoolID | yearID | teamID | |
|---|---|---|---|---|---|
| 0 | 4000000.0 | lindsma01 | idbyuid | 2014 | CHA |
| 1 | 4000000.0 | lindsma01 | idbyuid | 2014 | CHA |
| 2 | 3600000.0 | lindsma01 | idbyuid | 2012 | BAL |
| 3 | 3600000.0 | lindsma01 | idbyuid | 2012 | BAL |
| 4 | 2800000.0 | lindsma01 | idbyuid | 2011 | COL |
| 5 | 2800000.0 | lindsma01 | idbyuid | 2011 | COL |
| 6 | 2300000.0 | lindsma01 | idbyuid | 2013 | CHA |
| 7 | 2300000.0 | lindsma01 | idbyuid | 2013 | CHA |
| 8 | 1625000.0 | lindsma01 | idbyuid | 2010 | HOU |
| 9 | 1625000.0 | lindsma01 | idbyuid | 2010 | HOU |
| 10 | 1025000.0 | stephga01 | idbyuid | 2001 | SLN |
| 11 | 1025000.0 | stephga01 | idbyuid | 2001 | SLN |
| 12 | 900000.0 | stephga01 | idbyuid | 2002 | SLN |
| 13 | 900000.0 | stephga01 | idbyuid | 2002 | SLN |
| 14 | 800000.0 | stephga01 | idbyuid | 2003 | SLN |
| 15 | 800000.0 | stephga01 | idbyuid | 2003 | SLN |
| 16 | 550000.0 | stephga01 | idbyuid | 2000 | SLN |
| 17 | 550000.0 | stephga01 | idbyuid | 2000 | SLN |
| 18 | 410000.0 | lindsma01 | idbyuid | 2009 | FLO |
| 19 | 410000.0 | lindsma01 | idbyuid | 2009 | FLO |
| 20 | 395000.0 | lindsma01 | idbyuid | 2008 | FLO |
| 21 | 395000.0 | lindsma01 | idbyuid | 2008 | FLO |
| 22 | 380000.0 | lindsma01 | idbyuid | 2007 | FLO |
| 23 | 380000.0 | lindsma01 | idbyuid | 2007 | FLO |
| 24 | 215000.0 | stephga01 | idbyuid | 1999 | SLN |
| 25 | 215000.0 | stephga01 | idbyuid | 1999 | SLN |
| 26 | 185000.0 | stephga01 | idbyuid | 1998 | PHI |
| 27 | 185000.0 | stephga01 | idbyuid | 1998 | PHI |
| 28 | 150000.0 | stephga01 | idbyuid | 1997 | PHI |
| 29 | 150000.0 | stephga01 | idbyuid | 1997 | PHI |
This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)
a. Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.
a. Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.
a. Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.
The following table displays the top 5 players with the highest batting average with at least one appearance at bat. This is not a very useful table because a player that batted twice and hit both times would have a perfect AB. It is more fair to make the limit higher for AB appearances.
| playerID | yearID | b | |
|---|---|---|---|
| 0 | aybarma01 | 2001 | 1.0 |
| 1 | birasst01 | 1944 | 1.0 |
| 2 | brideji01 | 1953 | 1.0 |
| 3 | brownha01 | 1951 | 1.0 |
| 4 | brownpe01 | 1894 | 1.0 |
The Following Graph shows the same as the graph before but with more than 10 AB appearances. this is slightly more useful because it weeds out the people with lot AB appearances giving a more accurate representation of players that batted more than just a couple times. You notice the the batting percentage drops significantly.
| playerID | b | |
|---|---|---|
| 0 | nymanny01 | 0.642857 |
| 1 | carsoma01 | 0.636364 |
| 2 | silvech01 | 0.571429 |
| 3 | puccige01 | 0.562500 |
| 4 | applepe01 | 0.545455 |
This graph is the same as the prior with the stipulation that the player needs over 100 AB appearances. Similarly to the last one we see a relation that as we add a higher requirement of at bat appearances the bating average drops.
| playerID | b | |
|---|---|---|
| 0 | hazlebo01 | 0.402985 |
| 1 | daviscu01 | 0.380952 |
| 2 | fishesh01 | 0.374016 |
| 3 | woltery01 | 0.369565 |
| 4 | barnero01 | 0.366576 |
Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Lets-Plot to visualize the comparison. What do you learn?
The following displays a graph comparing two teams: Boston Red Stockings and Chicago White Stockings win to loss ratio in the franchises history. We can see from the graph the the Boston Red Stockings historically have been a better team if you’re basing it off strictly their Win Ratio.
q = '''
SELECT name, TeamID, AVG(CAST(W as float)/CAST(G as float)) as Win_Ratio
FROM teams
WHERE TeamID IN ('BS1','CH1')
GROUP BY name, TeamID
'''
results = pd.read_sql_query(q,con)
display(results)
# Plot bar chart of Win_Ratio by Team
ggplot(results, aes(x='name', y='Win_Ratio')) + \
geom_bar(stat = 'identity', width = .5) + \
ggtitle('Historical Win Ratio Comparison') + \
xlab('Team') + \
ylab('Win Ratio')| name | teamID | Win_Ratio | |
|---|---|---|---|
| 0 | Boston Red Stockings | BS1 | 0.754515 |
| 1 | Chicago White Stockings | CH1 | 0.678571 |
Advanced Salary Distribution by Position (with Case Statement):
* Write an SQL query that provides a summary table showing the average salary for each position (e.g., pitcher, catcher, outfielder). Position information can be found in the fielding table in the POS column.
Include the following columns:
* position
* average_salary
* total_players
* highest_salary
* The highest_salary column should display the highest salary ever earned by a player in that position.
* Additionally, create a new column called salary_category using a case statement:
* If the average salary is above $3 million, categorize it as “High Salary.”
* If the average salary is between $2 million and $3 million, categorize it as “Medium Salary.”
* Otherwise, categorize it as “Low Salary.”
* Order the table by average salary in descending order.
**Hint:** Beware, it is common for a player to play multiple positions in a single year. For this analysis, each player’s salary should only be counted toward one position in a given year: the position at which they played the most games that year. This will likely require a (sub-query)[https://docs.data.world/documentation/sql/concepts/advanced/WITH.html].
From the following table we can see that in general the highest paid positions are 1st basemen making over 3 million. 2nd Basemen and Outfielders on average fall in the 2 - 3 million range. And all other positions fall in the low salary range. I thought this was interesting because everyone would assume the pitcher to be a high paying position. I think the data is skewed because if you notice the count pitcher has by far the highest. This lead me to think that there must be some low outliers dragging down average. One thing that could be done Add a having statement by position that filters out all players with lower than a certain threshold of salary.
q = '''
SELECT f.pos, f.G, ROUND(FORMAT(AVG(s.salary),2),2) as avg_salary, Count(*) as Count,
CASE
WHEN AVG(s.salary) > 3000000 THEN "High Salary"
WHEN AVG(s.salary) BETWEEN 2000000 AND 3000000 THEN "Medium Salary"
WHEN AVG(s.salary) < 2000000 THEN "Low Salary"
Else "Make more Money"
END as rank
From fielding f
JOIN(
SELECT yearID, playerID, MAX(G) as max_g
FROM fielding
GROUP BY yearID, playerID
) mt
ON f.yearID = mt.yearID
AND f.playerID = mt.playerID
AND f.G = mt.max_g
JOIN salaries s
ON f.yearID = s.yearID
AND f.playerID = s.playerID
GROUP BY f.pos
'''
results = pd.read_sql_query(q,con)
display(results)| POS | G | avg_salary | Count | rank | |
|---|---|---|---|---|---|
| 0 | 1B | 109 | 3331718.39 | 1725 | High Salary |
| 1 | 2B | 54 | 1795006.55 | 1582 | Low Salary |
| 2 | 3B | 20 | 2306271.80 | 1619 | Medium Salary |
| 3 | C | 12 | 1428857.75 | 2140 | Low Salary |
| 4 | OF | 76 | 2389816.32 | 5185 | Medium Salary |
| 5 | P | 11 | 1939560.84 | 11876 | Low Salary |
| 6 | SS | 99 | 1973128.41 | 1474 | Low Salary |
Advanced Career Longevity and Performance (with Subqueries):
* Calculate the average career length (in years) for players who have played at least **10 games**. Then, identify the top 10 players with the longest careers (based on the number of years they played). Include their:
* playerID
* first_name
* last_name
* career_length
* The career_length should be calculated as the difference between the maximum and minimum yearID for each player.
The Following Graph shows the Top 10 players with the longest careers based on the debut game and their last game played this shows Nick Altrock being the player with the longest career of 35 years.
q = '''
SELECT p.playerID, p.nameFirst, p.nameLast, STRFTIME("%Y",p.finalGame) - STRFTIME("%Y", p.debut) as career_length
FROM people p
JOIN (SELECT playerID, SUM(G) as sg
FROM fielding
GROUP BY playerID
HAVING sg >= 10
) st
ON p.playerID = st.playerID
ORDER BY career_length DESC
LIMIT 10
'''
results = pd.read_sql_query(q,con)
display(results)| playerID | nameFirst | nameLast | career_length | |
|---|---|---|---|---|
| 0 | altroni01 | Nick | Altrock | 35 |
| 1 | orourji01 | Jim | O'Rourke | 32 |
| 2 | minosmi01 | Minnie | Minoso | 31 |
| 3 | olearch01 | Charley | O'Leary | 30 |
| 4 | lathaar01 | Arlie | Latham | 29 |
| 5 | mcguide01 | Deacon | McGuire | 28 |
| 6 | eversjo01 | Johnny | Evers | 27 |
| 7 | jennihu01 | Hughie | Jennings | 27 |
| 8 | ryanno01 | Nolan | Ryan | 27 |
| 9 | streega01 | Gabby | Street | 27 |
This I found interesting because I did the same graph of the Top ten players with the longest career but this time based on their year ID. What was interesting is that between this and the last graph Nick Altrock and Arlie Latham are not on the list when compared by year ID. Why is this? My thought process is what if these two players were on the bench or still included on the roster when they were no longer playing. This is interesting to me because comparing two things that you would assume would return the same data have discrepencies.
q = '''
SELECT p.playerID, p.nameFirst, p.nameLast, may - miy as career_length
FROM people p
JOIN (SELECT playerID, SUM(G) as sg, MAX(yearID) as may, MIN(yearID) as miy
FROM fielding
GROUP BY playerID
HAVING sg >= 10
) st
ON p.playerID = st.playerID
ORDER BY career_length DESC
LIMIT 10
'''
results = pd.read_sql_query(q,con)
display(results)| playerID | nameFirst | nameLast | career_length | |
|---|---|---|---|---|
| 0 | orourji01 | Jim | O'Rourke | 32 |
| 1 | altroni01 | Nick | Altrock | 31 |
| 2 | lathaar01 | Arlie | Latham | 29 |
| 3 | mcguide01 | Deacon | McGuire | 28 |
| 4 | eversjo01 | Johnny | Evers | 27 |
| 5 | jennihu01 | Hughie | Jennings | 27 |
| 6 | ryanno01 | Nolan | Ryan | 27 |
| 7 | streega01 | Gabby | Street | 27 |
| 8 | ansonca01 | Cap | Anson | 26 |
| 9 | johnto01 | Tommy | John | 26 |
| ID | playerID | yearID | stint | teamID | team_ID | lgID | G | G_batting | AB | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | abercda01 | 1871 | 1 | TRO | 8 | NA | 1 | None | 4 | ... | 0 | 0 | 0 | 0 | 0 | None | None | None | None | 0 |
| 1 | 2 | addybo01 | 1871 | 1 | RC1 | 7 | NA | 25 | None | 118 | ... | 13 | 8 | 1 | 4 | 0 | None | None | None | None | 0 |
2 rows × 25 columns